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Abstract 

The Gaussian process latent variable model (GP- 
LVM) is a popular approach to non-linear prob¬ 
abilistic dimensionality reduction. One design 
choice for the model is the number of latent vari¬ 
ables. We present a spike and slab prior for the 
GP-LVM and propose an efficient variational in¬ 
ference procedure that gives a lower bound of the 
log marginal likelihood. The new model provides 
a more principled approach for selecting latent 
dimensions than the standard way of threshold¬ 
ing the length-scale parameters. The effective¬ 
ness of our approach is demonstrated through ex¬ 
periments on real and simulated data. Further, we 
extend multi-view Gaussian processes that rely 
on sharing latent dimensions (known as mani¬ 
fold relevance determination) with spike and slab 
priors. This allows a more principled approach 
for selecting a subset of the latent space for each 
view of data. The extended model outperforms 
the previous state-of-the-art when applied to a 
cross-modal multimedia retrieval task. 


1. Introduction 

Gaussian Process latent variable models (GP-LVM) re¬ 
duce the dimensionality of data by establishing a mapping 
from a low dimensional latent space, X, to a high dimen¬ 
sional observed space, Y, through Gaussian Process (GP) 
(Lawrence, 2005; Titsias & Lawrence, 2010). The non- 
parametric nature of GP and the flexibility of using non¬ 
linear kernels enables GP-LVM to produce compact repre¬ 
sentations of data. GP-LVM has been successfully applied 


to various domains as a dimension reduction method. For 
example, Buettner & Theis (2012) used GP-LVM for re¬ 
solving differences in single-cell gene expression patterns 
from zygote to blastocyst, and Lu & Tang (2014) developed 
a discriminative GP-LVM for face verification that was the 
first to surpass human-level performance. 

When applying GP-LVM to dimension reduction problems, 
a key parameter of choice is the dimensionality of the latent 
space Q. A larger latent dimensionality can correspond to 
a significantly higher number of parameters, which poten¬ 
tially leads to overfitting. A standard approach for choos¬ 
ing the latent dimensionality of GP-LVM is to look at 
the values of the length-scale parameters of kernel func¬ 
tions. These parameters characterise the scales of indi¬ 
vidual latent dimensions. The underlying latent function 
can only vary “slowly” along the latent dimension with a 
high length-scale. When the length scale of a latent di¬ 
mension is significantly larger than the ones of other di¬ 
mensions, the influence of this latent dimension to overall 
(co)variances is negligible. Therefore, the number of la¬ 
tent dimensions is conventionally determined according to 
the length-scale parameters, typically by comparing to a 
manually chosen threshold. It closely relates to the idea of 
automatic relevance determination (ARD) regression (Ras¬ 
mussen & Williams, 2005), where the ARD parameters are 
defined as one over the length scales. A limitation of this 
approach is that the choice of threshold could involve a lot 
of hand tuning. Furthermore, for non-linear kernels like 
the exponentiated quadratic form, it is difficult to figure out 
whether a relatively high length scale means that the latent 
dimension is “slowly” switched off or the latent dimension 
is very smooth (Vehtari, 2001). 

A more principled approach is preferred for automating the 
selection of the number of latent dimensions. It means 
to let the model decide which latent dimensions it really 
needs. Driven by this idea, we introduce spike and slab 
prior (Mitchell & Beauchamp, 1988; George & Mccul- 
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loch, 1993; West, 2003) for the latent variable X. A spike 
and slab prior contains binary variables, which allows the 
model to probabilistically discard latent dimensions. Given 
observed data Y, the posterior probability of a latent di¬ 
mension being used can be derived from the model defi¬ 
nition. It offers a principled approach for selecting latent 
dimensions. However, the exact inference of the posterior 
distribution of latent variables is intractable, because there 
is no closed form solution for the integral of latent variables 
in the log marginal likelihood. In this paper, we derive a 
closed form variational lower bound of the log marginal 
likelihood for the spike and slab GP-LVM and develop an 
efficient inference method for posterior distributions of la¬ 
tent representations X and switching variables. In spike 
and slab models standard mean field approximations are 
problematic due to the strong correlation between switch 
variables and input variables. Our variational approach as¬ 
sumes a conditional dependence between input variables 
and switch variables. It is closely related to the ideas like 
structured mean field (Saul & Jordan, 1995; Xing et al., 
2003). This allows us to efficiently infer the latent rep¬ 
resentations of data while simultaneously determining the 
active latent dimensions. 

In the literature, the spike and slab prior has been used for 
variable selection in various regression models. For in¬ 
stance, Carbonetto & Stephens (2012) developed a vari¬ 
ational inference method with spike and slab prior for 
Bayesian variable selection in linear models. Savitsky et al. 
(2011) introduced spike and slab prior to length scale pa¬ 
rameters in variable selection of GP regression models. 
They proposed inference through a Markov chain Monte 
Carlo (MCMC) scheme. In this paper, both switch vari¬ 
ables and input variables X are variationally integrated out. 
This enables us to infer latent representations for dimen¬ 
sion reduction. Note this contrasts with regression, where 
available inputs are only switched on or off. Here we are 
selecting both the number of available inputs and their na¬ 
ture (through the latent variable approach). Spike and slab 
priors have also been used in unsupervised learning for 
sparse linear models (i.e. sparse coding) with variational 
or truncated approximations (Titsias & Lazaro-Gredilla, 
2011; Sheikh et al., 2014). The spike and slab GP-LVM 
we introduced is much more flexible because it allows for 
the encoding of non-linear relationships through appropri¬ 
ate Gaussian process covariance selection. 

Through principled formulation of the selection of latent 
dimensions, our efficient variational approach allows us to 
extend the multi-view learning of GP with explicit separa¬ 
tion of latent spaces for related views of a data set. The 
multi-view GP model, known as manifold relevance deter¬ 
mination (MRD) in (Damianou et al., 2012), develops la¬ 
tent spaces that are shared amongst the different views and 
latent spaces that are particular to each given view. It for¬ 


mulation can be distilled as a set of inter-related GP-LVM 
models which share latent dimensions. Learning in the 
model consists of assigning each GP-LVM to a separate sub 
set of the latent dimensions through adjustment of length- 
scale parameters. Therefore, with applying a GP-LVM to 
each view of data, each view can “softly” decide which la¬ 
tent dimensions to use by variation of the length-scale pa¬ 
rameter. However, this introduces the same ambiguity we 
referred to above. A threshold must be selected for decid¬ 
ing when a latent dimension is being ignored. With the 
spike and slab prior, different views can focus on a subset 
of the latent space by discrete switching of the unnecessary 
dimensions. This provides a more principled approach for 
multi-view learning with GPs. Our variational spike and 
slab GP-LVM is easily extended to handle this particular 
case. 

Before introducing the new variational approximation, we 
first review the GP-LVM and its Bayesian counterpart. We 
then introduce our spike and slab GP-LVM, and the ex¬ 
tension to MRD. Finally we empirically demonstrate the 
effectiveness of our model in selecting latent dimensions 
with both synthetic and real data. We demonstrate the new 
multi-view approach on an image-text dataset, in which 
gives significantly better results than the previous state-of- 
the-art. 

2. Gaussian Process Latent Variable Model 

For unsupervised learning, we typically assume a set of 
observed data Y £ R NxD with N datapoints and D di¬ 
mensions for each datapoint. Our aim is to obtain a latent 
representations of the observed data, which we denote by 
X £ R ,Vx '--, where Q is the number of latent space. In 
GP-LVM, the relationship between latent representations 
and data is given by a Gaussian process 

p(f d \X)=Af(f d ;0,K), (1) 

and for simplicity we assume Gaussian noise, 

p(yd\fd) = N{yd] f d, /3 _1 l), (2) 

where y d £ R N is the dth dimension of the observed data, 
f,/ is called the noise-free observation, and K is the covari¬ 
ance matrix of X computed according a kernel function 
k(x,x'). By maximizing the log likelihood logp(Y\X), 
the point estimates of latent representations X can be ob¬ 
tained (Lawrence, 2005). However, the point estimation 
of X implies fitting a lot of parameters, therefore, the re¬ 
sulting model is prone to overfitting. Titsias & Lawrence 
(2010) overcome this limitation by introducing a unit Gaus¬ 
sian prior for X and deriving a variational lower bound of 
the log marginal likelihood (known as Bayesian GP-LVM). 
In their model, the sparse GP formulation (Titsias, 2009) is 
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used. This relies on augmenting our variable space with a 
set of inducing variables, u, and the model becomes: 

p(fd|u d , X, Z ) =AT(f d ; K fu K~^u d , 

K ff - K fu K-'K] u ), (3) 

p{u d \Z) =Af(u d -,0,K uu ), (4) 

N 

p(X) = []AA(x n ; 0 ,I), ( 5 ) 

n—1 

where u,; is the inducing variable for rfth dimension, and 
Z is the inducing input. Kff and K uu are the covariance 
matrices for X and Z respectively, and Kf u is the cross co- 
variance matrix between X and Z. Then, the log marginal 
likelihood of the model is defined as 

D 

p(Y)= ll p(y d \X)p(X)dX. (6) 

d= 1 

For tractability, the integral of X in the log marginal likeli¬ 
hood is approximated variationally. A variational posterior 
distribution of X is defined as: 

N 

q(X)=l[Af{x n -,Vn,S n ). (7) 

n—1 


overall (co)variances is negligible. Therefore, a typical 
approach selects latent dimensions by thresholding their 
length scales. In this work, we propose a more principled 
approach, where each latent dimension is intrinsically con¬ 
sidered whether it is used or not, by explicitly introducing a 
latent switching variable b € {0,1}^ that consists of a set 
of binary variables, each controlling the usage of a latent 
dimension. 

3. Spike and Slab GP-LVM 

We introduce a switch variable b £ {0,1}^ that deter¬ 
mines whether a particular latent dimension is used or not. 
The way that the switch variable controls the usage of a la¬ 
tent dimension is done by replacing the input variable x„ 
in the original Bayesian GP-LVM by x„ o b, where o de¬ 
notes the element-wise multiplication. If a binary variable 
in b is zero, the input of the corresponding dimension to 
the underlying GP becomes zero. As the input variable X 
has a Gaussian prior, the combination of X and b is known 
as a spike and slab prior. The prior distribution of the latent 
switch variables b are goverend by Bernoulli distributions, 

Q 

p( b )=n 7rfc, ( i _ 7r ) (i ' b,) > cn) 

9=1 


With the assumption p(f d \y d , u d , X) = p(f d \u d ,X), a 
lower bound of log marginal likelihood can be obtained 
with Jensen’s inequality: 


logp(F) > £ F d {q) - KL(q(X)\\p(X)), (8) 

d= 1 

(P)*\K uu \i 


F d {q) =log 


o~hVd Wy d 


(2tt) 2 |/3Ti 2 + K uu 1 2 


- ^° + f Tr(iT-^ 2 ), (9) 

KL(q(X)\\p(X)) = J q (X)\og^dX, (10) 


where W = f3I — /3 2 T'i(/3\I'2 + K uu ) 1 T' ] r . ip 0 , Tr and 
\D 2 are the expectation of the covariance matrices w.r.t. 
q{X), i.e. V’o = Tr(E q{x) [K ff ]), % = E q{x) [K fu \, 
= ^q(x) [Kj u K fv ]. With this formulation, the la¬ 
tent variable X is variationally integrated out. It leads to 
a closed-form lower bound of the log marginal likelihood. 

In this formulation, the selection of latent dimensionality 
relies on parameters called length scales, each of which 
is applied to scale separately the input dimensions, e.g., 1 
in the exponentiated quadratic kernel function fc(x. x') = 
aj exp(-i _ x 'q) 2 /l 2 q)- If the length scale of a 

latent dimension is significantly lower than the other la¬ 
tent dimensions, the influence of this latent dimension to 


where 7r is the prior probability, which typically takes the 
value 0.5. Due to the introduction of b, all the x„ in orig¬ 
inal covariance matrices are replaced with b o x n . This 
changes the form of the cross covariance matrix Kf u . In 
the case of the exponentiated quadratic kernel it becomes 


( K fv 


a 2 j exp 


^ ^ (5gX n q Zrriq) 
9=1 


2 ll 


( 12 ) 


where (-) nm indicates the nth row and mth column of 
the matrix, while K uu is not affected. To make inference 
tractable, we take a variational approach, where we assume 
a variational posterior distribution for the spike and slab 
model. As the switch variable and the slab variable are 
strongly correlated, the variational posterior is defined as a 
conditional distribution. 


Q 

7(b) = n% &, ( 1 -^) (1 ^ ,) ’ (13 > 

9=1 

q( x nq\bq — 1 ) — X" (x n q , finq i 5 ( 14 ) 

where is the posterior probability of the c/th dimension 
being used (the overall probability of the slab part), and p nq 
and s nq are the mean and variance of the variational poste¬ 
rior distribution for the slab part. It is closely related to the 
ideas like structured mean field (Saul & Jordan, 1995; Xing 
et ah, 2003). Therefore, the lower bound of log marginal 
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likelihood becomes 

D 


logp(y) >^2F d (q) -KL(g(b,X)||p(b)p(X)), (15) 


d=l 


KL(g(b,X)||p(b)p(X)) = 


£ / (16) 


where F d (q) keeps the same form as in (9), while 'b | and 
'1^2 need to adapted according to the new Kf „. Then, the 
new thi and 'IG are defined as: 


(*t) nm — X^( b ) / fc(b ox„,z m )g(x„|b)dx„, 

r. J 


N 




2 )mm' 


= ^2^2q(h) / k( box„,z m ) 

n—1 b ^ 

H z m ',b o x„)g(x„|b)dx„. 


(17) 


(18) 


For some kernels, the closed-form solution for i/jq, T and 
'f , 2 can be obtained. For example, ipo, T| and of the 
exponentiated quadratic kernel are derived as: 


ipo = Naj, 

Q r 

(’ki)nm = cr 2 f n [ 


(19) 


= 1 {Snq/lq + l) 2 


1 (B-nq ~ z mq ) 

2 s nq + lq 


+ (1-7 nq)e 2I i 


( 20 ) 


N Q 


(^2 )mm' — a f TT [ 


7 nq 


=la =1 {‘ZSnq/lq + 1) 2 


n=lj=l \ £j0 nq / l q 

(. z mq— z rn / ) 2 (.IJ-nq— ( z mq+ z m / „)/ 2 ) 2 


2 s n q -\-lq 


(4 0 +C/J, 


+ (1 - 7n q)e~ 


( 21 ) 


Note that the distribution q(x nq \b q = 0), because, as the 
switch variable is zero, the slab variable does not influence 
the likelihood anymore, so that q(x nq \b q = 0) will only ap¬ 
pear inside the KL divergence, which make it always equal 
to the prior distribution p(X). 

4. Spike and Slab MRD 

In manifold relevance determination (Damianou et ah, 
2012), multiple views of data are considered simultane¬ 
ously. The model assumes that those views share some 
aspects and retain some aspects that are particular to each 
view. Each latent dimension can both relate to other views 
in some shared latent dimensions while keeping some pri- 




Figure 1. (a) The graphical model for the original MRD, where 
Y and Z denotes two views of the data and X denote the latent 
variable, (b) The graphical model for the spike and slab MRD. 


vate latent dimension for its own (see Fig. la). In the orig¬ 
inal MRD paper this effect was achieved through appro¬ 
priate sharing of ARD parameters within each view. This 
leads to a soft sharing approach where a particular latent 
dimension can be used to a greater or lesser extent by each 
of the views. Spike and slab MRD allows for probabilis¬ 
tic selection of binary variables to perform the sharing (see 

Fig. lb). 

We wish to relate C views Y^ £ R NxDc of a dataset in 
our model. We assume the dataset can be represented as 
a latent variable X £ K 7Vx( 2, in which a different subset 
of the Q latent dimensions are used to represent each view. 
The selection of latent dimensions for cth view is done by 
a vector of latent binary variable b c £ with the prior 
distribution in (11). As mentioned in previous section, each 
view can be modeled by a spike and slab GP: 

c 

p(Y\X,B) = l[p(Y^\X,B). (22) 

C—1 

With the prior distribution of X and B, the marginal distri¬ 
bution of our MRD model is 

p{y)=[ (23) 

c=l B d=l 

For a tractable inference algorithm, we wish to introduce a 
variational approximation for the latent variable X and B. 
Similar to spike and slab GP-LVM, the latent variable X 
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and B are closely correlated, so that we define a conditional 
variational distribution g(X, B) = q(X\B) Yl^=i ?(b c ), in 
which there is a variational posterior distribution for each 
view representing the subset of latent dimensions used by 
that view, and the posterior distribution for the latent rep¬ 
resentation of data, which is the same for all the views. In 
order to be consistent with the choices B, q(X\B) is de¬ 
fined as 

Q 

a(b c ) = n^( 1 -7c g ) (1 - b - ) I (24) 

9=1 

C 

Q_(Xnq\ \J (bcq — 1)) — N\Xnqi Fnq'i Snq) i (25) 

c—1 

where \J is the or operation for binary variables. We 
denote the conditional variational posterior distribution 
q{x nq I \J° =l {b cq = 1)) as q c {x nq ). It gives the variational 
posterior of X for all the views if any of the views de¬ 
cide to use the latent dimension, otherwise the posterior 
distribution will fall back to the prior, therefore the poste¬ 
rior q(x nq | /\ c= i(6 cg = 0)) does not need to be explicitly 
defined. By variationally integrating out the latent variable 
X and B, we obtain a lower bound for our MRD model: 

c 

p(Y) = J2p(B) / Hp(Y^\X,b c )p(X\B)dX (26) 

B C— 1 


C D 


i 0 g P (y)>J2J2^ c \i) 

c—l d—1 

-KL(q(B,X)\\p(B)p(X)), (27) 

KL(q(B,X)\\p(B)p(X)) = f q(B,X) 

B 


With this definition, we will have a new 'l-'i and 4/ 2 corre¬ 
spondingly. 

c f 

(^[ C) )nm II 7( b c') / k( b c OX„,Z m ) 

B c'=l ' 

q(x n \B)(Bc n , (29) 

n c r 

(^2 C) )mm' II 7( b c') / k( b c OX„,Z m ) 

n—1 B c'=1 ** 

fe(z m /,b c ox„)q(x„|B)dx n , (30) 

however, they lead to exactly the same formulas as (19), 
(20) and (21). For efficient implementation, the compu¬ 
tation of t/j 0 , 'l-' i and 'h 2 which is usually the bottleneck 
can be easily parallelized by dividing data points into small 
groups and evaluating the results in a distributed way (Dai 
et ah, 2014; Gal et ah, 2014). 



(a) 



(b) 

Figure 2. (a) The three latent signals used for generating observed 
data, (b) Two observed data generated from the latent signals. The 
y-axis shows the different dimensions in the observed data, 12 di¬ 
mensions for each. The x-axis shows the data in each dimensions, 
and 50 samples are drawn evenly. 


5. Experiments 

We first demonstrate our model with synthetic data. We 
aim to recover the latent signal from two sets of multi¬ 
dimensional observed data. Then, we show the effective¬ 
ness of the switch variable as a cue for choosing latent 
dimensions. After that, we apply our SSMRD model to 
a text-image dataset, where our model gives significantly 
better results than state of the art performance. 

5.1. Synthetic Data 

We first apply both of our models to a synthetic data, where 
that the nature of the latent representation is known and we 
can ascertain when it is correctly reconstructed. We intro¬ 
duced three artificial signal sources, which are y = sin(.x'), 
y = — exp(— cos(2a;)) and y = cos(x)), and drew 50 sam¬ 
ples evenly from the interval between 0 and 27r (see Fig. 
2(a)). The drawn samples are normalized to zero mean and 
unit variance. These are the latent signals that we would 
like to recover. We generated two sets of observed sig¬ 
nals. For the first observed signal, we combined the 1st and 
3rd signals, and transformed into a 12 dimensional signals 
through a random linear transformation. The second ob- 
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Figure 3. (a) The recovered latent signals according to our SSGP- 
LVM model from the first signal in Fig. 2(b). It shows the con¬ 
ditional variational posterior distribution q c (X). Each row cor¬ 
responds to a latent dimension, and the curve shows the mean, 
and the width of the colored region shows the variance, (b) The 
learned posterior probability for the switch variable b. (c) The 
learned length scales of latent dimensions. 


served signal are generated in the way combining the 2nd 
and 3rd signals (see Fig. 2(b)). 

We first applied the spike and slab GP-LVM model to the 
first observed signal to see whether it can recover the latent 
signals and determine the used latent dimensions if you of¬ 
fer more latent dimensions than the number of underlying 
signals. We applied our model with a linear kernel and 5 
latent dimensions. The recovered latent signals are shown 
in Fig. 3(a) and Fig. 3(b). The learned 1st and 2nd latent di¬ 
mensions successfully recover the latent signals with very 
small posterior variances, and the posterior probabilities of 
these two latent dimensions are close to 1 while the rest are 
close to zero, which means the model only used the first 
two latent dimensions to explain the data. The lengthscale 
parameters are ploted in Fig. 3(c), where there are a big 
difference between the used and unused dimensions, which 
matches the observations with Bayesian GP-LVM. It per¬ 
fectly matches with the information that we put into the 
data. 

We then applied the spike and slab MRD model to both 
observed sets signals, taking each observed signal set as a 
different view of the data. The way the data was generated 


Figure 4. (a) The recovered latent signals according to our 
SSMRD model by assigning two views to the two observed data 
respectively, (b) The learned posterior probability for the switch 
variable b. Different colors correspond to different views, (c) The 
learned length scales of latent dimensions. 

implies that each view has one private latent signal and also 
shares a common latent signal. We aim to recover all the la¬ 
tent signals with a correct assignment of latent dimensions. 
We applied a linear kernel for each view and 5 latent di¬ 
mensions. The recovered latent signals are shown in Fig. 
4(a) and Fig. 4(b). All the three latent signals are recovered 
with very small posterior variances. The 1st view takes the 
1st latent dimension as its private space, which recovers its 
private latent signal, and the 2nd view takes the 4th late di¬ 
mension for its private signals, where the 5th latent dimen¬ 
sion are shared by both views, which recovers the shared 
latent signal. The 2nd latent dimensions are used by both 
views to give some structured noise, in which the inferred 
variance of the signal is significantly higher than the true 
signals. The inferred length scale for the kernels of both 
views are shown in Fig. 4(c). Note that it is not instantly 
clear how to threshold the latent dimensions according to 
length scale parameters. For instance, for the 1st view, the 
length scales of the 1st and 5th dimensions are roughly at 
the same level which corresponds to the true signals, while 
the 2nd and 4th dimensions also have relatively low length 
scales. However, according to the posterior probabilities of 
its switch variable, the 2nd dimension is used by the model 
while the 4th dimension is the private space of the other 
view. In this case, thresholding the length scale parameters 
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is not able to give the same answer as observed according 
to the posterior probabilities of switch variable. 

5.2. Classification data 

We next considered a data set of hand written digits to 
quantify the number of latent dimensions required to repre¬ 
sent the digits. We took a subset of images from the MNIST 
dataset (LeCun et al., 1998). We chose the images digits of 
“1”, “7” and “9”, and took 1,000 images for each character 
for training and 1,000 for testing. 

We applied the spike and slab GP-LVM to the training set 
for dimensional reduction, where the initial number of la¬ 
tent dimensions was chosen to be 20. The optimization of 
our model parameters was done in a purely unsupervised 
fashion, where no label information was used. The result¬ 
ing length scales of latent dimensions are shown in Fig. 
5(a), where the latent dimensions are sorted according to 
their length scales. The posterior probabilities of the switch 
variable are shown in Fig. 5(b). From these values we can 
see that the model actively makes use of 10 latent dimen¬ 
sions. 

After optimizing the model parameters, we use the learned 
model to infer the conditional variational posterior distri¬ 
butions of test images q c (X t ), which encode the posterior 
mean and variance of a datapoint in latent space if the cor¬ 
responding latent space are used. Afterwards, we apply 
the nearest neighbor classifier, which compares the pos¬ 
terior mean between training and testing data points, and 
predict the label of a test image according to the label of 
its nearest training image in the latent space. We test the 
classification accuracy for different latent space configura¬ 
tions. We chose the latent dimensions by thesholding their 
length scales at different values, by which we obtained 20 
different choices of latent space. We compared the perfor¬ 
mance of these choices of latent space with choosing the 
latent space according to the posterior probability of the 
switch variable 7. The comparison is shown in Fig. 5(c), 
where the different numbers in x-axis denotes the choices 
corresponding to different number of used latent dimen¬ 
sions, and 7 denotes the choice according to the parameter 
7 - 

By comparing Fig. 5(a) and Fig. 5(c), we see that the best 
classification accuracy can be obtained by using only 6 la¬ 
tent dimensions, but we do not see a significant changes in 
length scale between the 6th and 7th latent dimension. A 
human tentative choice is to use 7 latent dimensions, which 
can give a similar level of performance, but is difficult to 
automate such kind of decisions. On the other hand, the 
choice according to 7 has a slightly higher number of di¬ 
mensions, but it is trivial to make automatic decision based 
on that. 



Figure 6. The precision-recall curve for both image and text 
queries on the Wiki dataset. 


5.3. Text-Image Retrieval 

An interesting application of MRD models is to relate in¬ 
formation from different domains. For example, relating 
image to text can potentially solve ambiguities by looking 
at only a single view of the data. The image representa¬ 
tions of an object can have a lot of variances due to chang¬ 
ing in location, viewing angle, illumination conditions, etc. 
Purely from image data, it might be difficult to figure out 
different variants of the same object, but with text such am¬ 
biguities may be resolved. Similarly, image representations 
can help to resolve ambiguities in text, e.g., a facial image 
can easily tell different people with the same name. 

We show results on a text-image dataset collected from 
Wikipedia (Costa Pereira et al., 2014). The task here to 
perform multimedia information retrieval, i.e., given a text 
query, the algorithm needs to produce a ranking of the im¬ 
ages in the training set, and similarly given an image query 
to produce a ranking of texts. The Wikipedia dataset con¬ 
sists of 2173/693(training/testing) image-text pairs associ¬ 
ated with 10 different topics. We used the features for im¬ 
ages and texts provided by the authors. It has a 10D text 
feature extracted from a LDA model() for each document 
and a 128D SIFT histogram image features for the cor¬ 
responding image. The quality of the inferred ranking is 
assessed in terms of mean Average Precision (mAP) and 
precision-recall curves. 

We applied our SSMRD model by assigning image features 
to a view and text features to another view. Both image and 
text features are normalized to zero mean and unit vari¬ 
ance before inference. To overcome the unbalanced dimen¬ 
sionality between image and text features (128 v.s. 10), we 
replicate the text features 6 times, making it a 60D rep¬ 
resentation. An exponentiated quadratic kernel is used for 
each view and the number of latent dimensions is chosen to 
be 10. We initialize the mean of variational posterior fi ac- 
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Figure 5. (a) The learned length scales in our SSGP-LVM model from a subset of the MNIST dataset containing the digits “1”, “7” and 
“9” (sorted according to the length scale values), (b) The learned posterior probability of the switching variable 7. (c) The classification 
accuracy by taking different choices of latent dimensions according to their length scales, where the bar with 7 denotes the performance 
of choosing latent dimensions according to 7. 


Table 1. Mean Average Precision (mAP) Scores 



img. query 

txt. query 

avg. 

SSMRD 

0.170 

0.540 

0.355 

SCM 

0.362 

0.273 

0.318 

SM 

0.350 

0.249 

0.300 

CM 

0.267 

0.219 

0.243 

GMMFA 

0.264 

0.231 

0.248 

GMLDA 

0.272 

0.232 

0.253 


cording to the topic of each image-text pair by placing them 
on the vertices of a simplex structure. The rest of learn¬ 
ing runs without taking into account the topic information. 
The model learns a shared latent space between images and 
texts as well as their private latent space. After optimizing 
all the model parameters, the query results are produced 
by first searching the conditional variational posterior dis¬ 
tribution g(X*|h/ c )) given the query input, i.e., image or 
text features Y} c \ and ranking the training data according 
to the euclidean distance in the shared latent space. The 
qualities of the produced rankings are evaluated in terms of 
precision-recall curves (see Fig. 6) and mAP (see Tab. 1). 

The performances of state of art algorithms are also shown 
in Tab. 1. All the results used the same feature set ex¬ 
tracted with dataset. Correlation matching (CM), semantic 
matching (SM), and semantic correlation matching (SCM) 
are the methods proposed by the creators of the dataset 
(Costa Pereira et al., 2014), of which SCM gives the state 
of art performance. The results with generalized multi¬ 
view analysis (GMA) with LDA (GMLDA) and marginal 
Fisher analysis (GMMFA) are reported by (Sharma et ah, 
2012). The mAP measures are directly taken from their pa¬ 
pers. Our performance for text queries is significantly bet¬ 
ter than all the state of art algorithms. Its precision-recall 
curve drops much slower, compared with what is reported 
in (Costa Pereira et ah, 2014). Our performance for image 
queries is below the state of art performance. We suspect 


it is due to lack of enough inducing inputs (100 is used for 
the reported performance), which directly limits the mod¬ 
eling capability, and not introducing label information into 
learning. Note that all their algorithms except CM are su¬ 
pervised algorithms, while our model does not make use 
of label information during training except initializing the 
latent space. 

6. Conclusion 

Standard approaches to variable selection in Gaussian pro¬ 
cess latent variable models have relied on scaling priors to 
reduce the influence of particular dimensions. We have 
introduced switching variable and a spike and slab prior 
which allows us to explicitly model the switching on and 
off of particular latent dimensions. This provides a more 
principled approach for selecting latent dimensions. By 
variationally integrating out the spike and slab latent vari¬ 
able we derived a lower bound on the log marginal like¬ 
lihood. For efficient implementation we used a parallel 
version of the algorithm with GPU acceleration. In the 
GP-LVM, multiple view learning is achieved through man¬ 
ifold relevance determination, where the choices of latent 
dimensions for different views are explicitly modelled. We 
also applied the spike and slab approach to the MRD prior 
and were able to show significantly better than state of the 
art performance on a cross-modal multimedia retrieval task. 

Structural learning in Gaussian process models is becom¬ 
ing more important with the advent of deep Gaussian pro¬ 
cesses (Damianou & Lawrence, 2013). We envisage that 
the combination of spike and slab models, alongside ap¬ 
propriate infinite binary process (IBP) priors (Griffiths & 
Ghahramani, 2005) will enable structural learning of the 
composite process models. 
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